train <- read.csv("train.csv", header = TRUE)
Let X = GrLivArea
x = \(Q_4\) (4th quartile) of X
Y = SalePrice
y = \(Q_2\) (2nd quartile) of Y
df <- data.frame(X = train$GrLivArea, Y = train$SalePrice)
quantile(df$X, c(0, 0.25, 0.5, 0.75, 1))
## 0% 25% 50% 75% 100%
## 334.00 1129.50 1464.00 1776.75 5642.00
quantile(df$Y, c(0, 0.25, 0.5, 0.75, 1))
## 0% 25% 50% 75% 100%
## 34900 129975 163000 214000 755000
pdf <- function(var) {
approxfun(density(var))
}
cdf <- function(samp, val) {
return(integrate(pdf(samp), min(samp), min(val, max(samp)))[1]$value)
}
hist(df$X, probability = TRUE,
ylim = c(0, max(density(df$X)$y)))
lines(density(df$X))
plot(ecdf(df$X))
a <- seq(min(df$X), max(df$X), (max(df$X) - min(df$X)) / 100)
plot(a, sapply(a, function(z) cdf(df$X, z)), type = "l")
hist(df$Y, probability = TRUE,
ylim = c(0 , max(density(df$Y)$y)))
lines(density(df$Y))
plot(ecdf(df$Y))
b <- seq(min(df$Y), max(df$Y), (max(df$Y) - min(df$Y)) / 100)
plot(b, sapply(b, function(z) cdf(df$Y, z)), type = "l")
(pr_A <- (nrow(df[df$X > max(df$X) & df$Y > median(df$Y), ]) / nrow(df)) /
(nrow(df[df$Y > median(df$Y), ]) / nrow(df)))
## [1] 0
(pr_B <- nrow(df[df$X > max(df$X) & df$Y > median(df$Y), ]) / nrow(df))
## [1] 0
(pr_C <- (nrow(df[df$X < max(df$X) & df$Y > median(df$Y), ]) / nrow(df)) /
(nrow(df[df$Y > median(df$Y), ]) / nrow(df)))
## [1] 1
a. \(P(X > x | Y > y) = P(X > x \cap Y > y) / P(Y > y) = P(X > 5642 \cap Y > 163000) / P(Y > 163000) = (0 / 1460) / (728 / 1460) = 0\)
This is the probability that X or GrLivArea, the above grade (ground) living area in square feet, is greater than the fourth quartile or 100th percentile of that variable conditioned on the event that Y or SalePrice, the property’s sale price in dollars, is greater than the second quartile or median value of that variable.
b. \(P(X > x, Y > y) = P(X > 5642 \cap Y > 163000) = 0 / 1460 = 0\)
This is the joint probability that a property’s GrLivArea is greater than the fourth quartile of that variable and its SalePrice is greater than the second quartile of that variable.
c. \(P(X < x | Y > y) = P(X < x \cap Y > y) / P(Y > y) = P(X < 5642 \cap Y > 163000) / P(Y > 163000) = (728 / 1460) / (728 / 1460) = 1\)
This is the conditional probability that GrLivArea is less than the fourth quartile of that variable given that SalePrice is greater than the second quartile of that variable.
(cond_pr1 <- (nrow(df[df$X > max(df$X) & df$Y > median(df$Y), ]) / nrow(df)) /
(nrow(df[df$Y > median(df$Y), ]) / nrow(df)))
## [1] 0
(indep_pr1 <- (nrow(df[df$X > max(df$X), ]) / nrow(df)))
## [1] 0
cond_pr1 == indep_pr1
## [1] TRUE
(cond_pr2 <-
(nrow(df[df$X > quantile(df$X, 0.75) & df$Y > median(df$Y), ]) / nrow(df)) /
(nrow(df[df$Y > median(df$Y), ]) / nrow(df)))
## [1] 0.4326923
(indep_pr2 <- (nrow(df[df$X > quantile(df$X, 0.75), ]) / nrow(df)))
## [1] 0.25
cond_pr2 == indep_pr2
## [1] FALSE
(cond_pr3 <- (nrow(df[df$X > median(df$X) & df$Y > median(df$Y), ]) / nrow(df)) /
(nrow(df[df$Y > median(df$Y), ]) / nrow(df)))
## [1] 0.7884615
(indep_pr3 <- (nrow(df[df$X > median(df$X), ]) / nrow(df)))
## [1] 0.4993151
cond_pr3 == indep_pr3
## [1] FALSE
(t1 <- table(df$X > max(df$X), df$Y > median(df$Y)))
##
## FALSE TRUE
## FALSE 732 728
chisq.test(t1)
##
## Chi-squared test for given probabilities
##
## data: t1
## X-squared = 0.010959, df = 1, p-value = 0.9166
(t2 <- table(df$X > quantile(df$X, 0.75), df$Y > median(df$Y)))
##
## FALSE TRUE
## FALSE 682 413
## TRUE 50 315
chisq.test(t2)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t2
## X-squared = 256.53, df = 1, p-value < 2.2e-16
(t3 <- table(df$X > median(df$X), df$Y > median(df$Y)))
##
## FALSE TRUE
## FALSE 577 154
## TRUE 155 574
chisq.test(t3)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t3
## X-squared = 483.29, df = 1, p-value < 2.2e-16
(t4 <- table(ceiling((ecdf(df$X)(df$X) / 0.25)), ceiling((ecdf(df$Y)(df$Y) / 0.25))))
##
## 1 2 3 4
## 1 224 133 8 0
## 2 96 122 133 13
## 3 31 73 149 113
## 4 14 35 75 241
chisq.test(t4)
##
## Pearson's Chi-squared test
##
## data: t4
## X-squared = 908.28, df = 9, p-value < 2.2e-16
Above, I test the independence of the variables \(X\) and \(Y\) by comparing the conditional probability \(P(X > x | Y > y)\) with the probability \(P(X > x)\) for three values of \(x\), \(x = {4_Q(X), 3_Q(X), 2_Q(X)}\). In other words, I compare the conditional probabilities that GrLivArea is greater than the fourth quartile, third quartile, and median values for that variable given that SalePrice is greater than the median property sale price with the corresponding unconditioned probability of the event that GrLivArea is greater than the specified threshold values. If the two variables were independent, the conditional probability \(P(X > x | Y > y)\) would be equal to \(P(X > x)\), as the event that \(Y > y\) would provide no additional information about the likelihood of \(X\) exceeding one of the examined threshold values. Here, the conditional and unconditioned probabilities are only equal in the case where \(x = 4_Q(X)\) since there are no values in \(X\) greater than the fourth quartile and so both probabilities are equal to zero. Since the conditional and unconditioned probabilities found for the other values of \(x\) were not equal, we can conclude that the variables \(X\) and \(Y\) are not independent of one another.
Chi-squared testing on two-way contingency tables of \(X > x\) and \(Y > y\) for the threshold values of \(x\) used in the comparisons above, as well as on the contingency table comprised of the counts obtained by binning each variable at their respective quartile boundaries, confirm an association between the two variables. All chi-squared tests aside from the first on t1 where \(x = 4_Q(X)\), which comprises the counts of cases in which \(Y > y \cap X < x\) and \(Y \leq \cap X < x\) since there are no cases in which \(X > x\), yield p-values less than 0.05, so we can reject the null hypothesis that the two variables are independent.
summary(df$X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
var(df$X)
## [1] 276129.6
sd(df$X)
## [1] 525.4804
hist(df$X)
boxplot(df$X)
summary(df$Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 130000 163000 180900 214000 755000
var(df$Y)
## [1] 6311111264
sd(df$Y)
## [1] 79442.5
hist(df$Y)
boxplot(df$Y)
plot(df)
qqnorm(lm(Y ~ X, df)$residuals)
qqline(lm(Y ~ X, df)$residuals)
library(MASS)
## transform X
bc <- boxcox(X ~ 1, data = df, lambda = seq(-2, 2, len = 1000))
## 95% CI for lambda
range(bc$x[bc$y > max(bc$y) - 1/2 * qchisq(0.95,1)])
## [1] -0.1101101 0.1221221
lambda_X <- bc$x[which.max(bc$y)]
df$X_bc <- (df$X^lambda_X - 1) / lambda_X
## transform Y
bc <- boxcox(Y ~ 1, data = df, lambda = seq(-2, 2, len = 1000))
## 95% CI for lambda
range(bc$x[bc$y > max(bc$y) - 1/2 * qchisq(0.95,1)])
## [1] -0.16616617 0.01401401
lambda_Y <- bc$x[which.max(bc$y)]
df$Y_bc <- (df$Y^lambda_Y - 1) / lambda_Y
plot(df$X_bc, df$Y_bc)
qqnorm(lm(Y_bc ~ X_bc, df)$residuals)
qqline(lm(Y_bc ~ X_bc, df)$residuals)
plot(log(df$X), log(df$Y))
qqnorm(lm(log(Y) ~ log(X), df)$residuals)
qqline(lm(log(Y) ~ log(X), df)$residuals)
library(psychometric)
## Loading required package: multilevel
## Loading required package: nlme
(r_bc <- cor(df$X_bc, df$Y_bc))
## [1] 0.7293698
z_r_bc <- 0.5 * log((1 + r_bc)/(1 - r_bc))
se_r <- 1 / sqrt(nrow(df) - 3)
(CIr_bc <- data.frame(lower = (exp(2 * (z_r_bc - qnorm(0.995) * se_r)) - 1) /
(exp(2 * (z_r_bc - qnorm(0.995) * se_r)) + 1),
upper = (exp(2 * (z_r_bc + qnorm(0.995) * se_r)) - 1) /
(exp(2 * (z_r_bc + qnorm(0.995) * se_r)) + 1)
))
## lower upper
## 1 0.6962049 0.7594276
CIr(r_bc, nrow(df), level = 0.99)
## [1] 0.6962049 0.7594276
(r_ln <- cor(log(df$X), log(df$Y)))
## [1] 0.7302549
z_r_ln <- 0.5 * log((1 + r_ln)/(1 - r_ln))
se_r <- 1 / sqrt(nrow(df) - 3)
(CIr_ln <- data.frame(lower = (exp(2 * (z_r_ln - qnorm(0.995) * se_r)) - 1) /
(exp(2 * (z_r_ln - qnorm(0.995) * se_r)) + 1),
upper = (exp(2 * (z_r_ln + qnorm(0.995) * se_r)) - 1) /
(exp(2 * (z_r_ln + qnorm(0.995) * se_r)) + 1)
))
## lower upper
## 1 0.6971794 0.760228
CIr(r_ln, nrow(df), level = 0.99)
## [1] 0.6971794 0.7602280
## permuatation test on Box-Cox transformed variables
cor_coefs <- vector("numeric", 10000)
for (i in 1:10000) {
Y_prime <- sample(df$Y_bc, length(df$Y_bc), replace = FALSE)
cor_coefs[i] <- cor(df$X_bc, Y_prime)
}
head(sort(round(cor_coefs, digits = 3), decreasing = TRUE))
## [1] 0.105 0.091 0.086 0.082 0.082 0.081
(p_val <- sum(abs(cor_coefs) > abs(r_bc)) / length(cor_coefs))
## [1] 0
## 99% CI - bootstrap method on Box-Cox transformed variables
cor_coefs <- vector("numeric", 10000)
for (i in 1:10000) {
rows <- sample(1:nrow(df), nrow(df), replace = TRUE)
cor_coefs[i] <- cor(df[rows, ]$X_bc, df[rows, ]$Y_bc)
}
quantile(cor_coefs, c(0.005, 0.995))
## 0.5% 99.5%
## 0.6929419 0.7638510
## permuatation test on log transformed variables
cor_coefs <- vector("numeric", 10000)
for (i in 1:10000) {
Y_prime <- sample(log(df$Y), length(log(df$Y)), replace = FALSE)
cor_coefs[i] <- cor(log(df$X), Y_prime)
}
head(sort(round(cor_coefs, digits = 3), decreasing = TRUE))
## [1] 0.097 0.089 0.088 0.084 0.083 0.082
(p_val <- sum(abs(cor_coefs) > abs(r_bc)) / length(cor_coefs))
## [1] 0
## 99% CI - bootstrap method on log transformed variables
cor_coefs <- vector("numeric", 10000)
for (i in 1:10000) {
rows <- sample(1:nrow(df), nrow(df), replace = TRUE)
cor_coefs[i] <- cor(log(df[rows, ]$X), log(df[rows, ]$Y))
}
quantile(cor_coefs, c(0.005, 0.995))
## 0.5% 99.5%
## 0.6933705 0.7637877
After performing Box-Cox transformations on both \(X\) and \(Y\) using the values of the parameter \(\lambda\) with the maximum log-likelihood - and also performing simple log transformations on both variables since the 95% confidence intervals of the log-likelihood optimizing values of \(\lambda\) for each straddled zero - I computed the correlation and associated 99% confidence interval for each pair of transformed variables. Then, I tested the null hypothesis that the true correlation coefficient \(\rho\) is equal to zero against the alternative hypothesis that \(\rho\) is not equal to zero using a permutation test. Here, new sets of paired values \((x_i, y_{i'})\) were derived from the original set of paired values \((x_i, y_i)\) by randomly sampling \(y_{i'}\) without replacement from all of the values in \(y_i\), and the correlation of the permuted value pairs was calculated. This process was repeated 10,000 times and then a p-value for a two-sided test of the null hypothesis \(\rho = 0\) was calculated as the proportion of correlation coefficients in the 10,000 sets of permuted value pairs greater than the value of the correlation coefficient obtained from the original dataset. In this case, the p-value was equal to zero. I also applied the bootstrap method to approximate a sampling distribution for \(\rho\) and compute a 99% confidence interval. Here, I performed resampling with replacement of the same number of paired values as contained in the original dataset and then calculated the correlation coefficient of the resampled data. This process was also iterated 10,000 times and the resulting distribution of resampled correlation coefficients was used as an approximation of the sampling distribution for \(\rho\). The lower boundary of the 99% confidence interval was approximately 0.69, supporting the conclusion of the permutation test. In addition, the 99% confidence interval obtained through bootstrap sampling agreed closely with the confidence interval estimated earlier using the Fisher transformation. Very similar results were obtained for hypothesis testing of the correlation coefficient of both the Box-Cox transformed and log-transformed variable pairs, in other words, each pair of transformed variables provided strong evidence against the null hypotheses that the true correlation coefficients are zero.
(cor_mat <- cor(data.frame(X_bc = df$X_bc, Y_bc = df$Y_bc)))
## X_bc Y_bc
## X_bc 1.0000000 0.7293698
## Y_bc 0.7293698 1.0000000
(cor_inv <- solve(cor_mat))
## X_bc Y_bc
## X_bc 2.136662 -1.558417
## Y_bc -1.558417 2.136662
cor_mat %*% cor_inv
## X_bc Y_bc
## X_bc 1 0
## Y_bc 0 1
cor_inv %*% cor_mat
## X_bc Y_bc
## X_bc 1 0
## Y_bc 0 1
cor_mat %*% cor_inv == cor_inv %*% cor_mat
## X_bc Y_bc
## X_bc TRUE TRUE
## Y_bc TRUE TRUE
min(df$X) > 0
## [1] TRUE
(nrml_fit <- fitdistr(df$X, densfun = "normal"))
## mean sd
## 1515.46370 525.30039
## ( 13.74774) ( 9.72112)
qqnorm(df$X)
qqline(df$X)
h <- hist(df$X)
rnd <- rnorm(1000, mean = nrml_fit$estimate[1],
sd = nrml_fit$estimate[2])
par(mfrow = c(1, 2))
plot(h)
hist(rnd,
main = paste0("Histogram of 1000 samples", "\n",
"from fitted normal", "\n",
" density function"),
xlab = paste0("Random samples from", "\n", "N(",
round(nrml_fit$estimate[1], digits = 2), ", ",
round(nrml_fit$estimate[2], digits = 2), ")"),
xlim = c(min(c(h$breaks, min(rnd))), max(h$breaks)))
par(mfrow = c(1, 1))
(lognrml_fit <- fitdistr(df$X, densfun = "log-normal"))
## meanlog sdlog
## 7.267774383 0.333436175
## (0.008726424) (0.006170513)
qqnorm(log(df$X))
qqline(log(df$X))
rnd <- exp(rnorm(1000, mean = lognrml_fit$estimate[1],
sd = lognrml_fit$estimate[2]))
par(mfrow = c(1, 2))
plot(h)
hist(rnd,
main = paste0("Histogram of 1000 samples", "\n",
"from fitted log-normal", "\n",
"density function"),
xlab = paste0("Random samples from", "\n", "exp(N(",
round(lognrml_fit$estimate[1], digits = 2), ", ",
round(lognrml_fit$estimate[2], digits = 2), "))"),
xlim = c(min(c(h$breaks, min(rnd))), max(h$breaks)))
par(mfrow = c(1, 1))
Using the fitdistr function from the MASS package, I fit both normal and, informed by the work above, log-normal density functions to the independent variable \(X\). Comparison of histograms of the original, non-transformed variable and of 1000 samples generated from each of the fitted density functions indicate that while both of the fitted density functions provide good approximations of the center of the distribution of the original variable, the log-normal fit does a much better job of capturing and reflecting the positive or right skew of the original data.
train <- read.csv("train.csv", header = TRUE)
test <- read.csv("test.csv", header = TRUE)
str(train)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
## $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
## $ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
## $ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
## $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
## $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
## $ GarageType : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
## $ GarageCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
## $ MiscFeature : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
sapply(train, summary)
## $Id
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 365.8 730.5 730.5 1095.0 1460.0
##
## $MSSubClass
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 20.0 50.0 56.9 70.0 190.0
##
## $MSZoning
## C (all) FV RH RL RM
## 10 65 16 1151 218
##
## $LotFrontage
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 21.00 59.00 69.00 70.05 80.00 313.00 259
##
## $LotArea
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10520 11600 215200
##
## $Street
## Grvl Pave
## 6 1454
##
## $Alley
## Grvl Pave NA's
## 50 41 1369
##
## $LotShape
## IR1 IR2 IR3 Reg
## 484 41 10 925
##
## $LandContour
## Bnk HLS Low Lvl
## 63 50 36 1311
##
## $Utilities
## AllPub NoSeWa
## 1459 1
##
## $LotConfig
## Corner CulDSac FR2 FR3 Inside
## 263 94 47 4 1052
##
## $LandSlope
## Gtl Mod Sev
## 1382 65 13
##
## $Neighborhood
## Blmngtn Blueste BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert
## 17 2 16 58 28 150 51 100 79
## IDOTRR MeadowV Mitchel NAmes NoRidge NPkVill NridgHt NWAmes OldTown
## 37 17 49 225 41 9 77 73 113
## Sawyer SawyerW Somerst StoneBr SWISU Timber Veenker
## 74 59 86 25 25 38 11
##
## $Condition1
## Artery Feedr Norm PosA PosN RRAe RRAn RRNe RRNn
## 48 81 1260 8 19 11 26 2 5
##
## $Condition2
## Artery Feedr Norm PosA PosN RRAe RRAn RRNn
## 2 6 1445 1 2 1 1 2
##
## $BldgType
## 1Fam 2fmCon Duplex Twnhs TwnhsE
## 1220 31 52 43 114
##
## $HouseStyle
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer SLvl
## 154 14 726 8 11 445 37 65
##
## $OverallQual
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.000 6.099 7.000 10.000
##
## $OverallCond
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 5.000 5.575 6.000 9.000
##
## $YearBuilt
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1872 1954 1973 1971 2000 2010
##
## $YearRemodAdd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1950 1967 1994 1985 2004 2010
##
## $RoofStyle
## Flat Gable Gambrel Hip Mansard Shed
## 13 1141 11 286 7 2
##
## $RoofMatl
## ClyTile CompShg Membran Metal Roll Tar&Grv WdShake WdShngl
## 1 1434 1 1 1 11 5 6
##
## $Exterior1st
## AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd
## 20 1 2 50 1 61 222 1 220
## Plywood Stone Stucco VinylSd Wd Sdng WdShing
## 108 2 25 515 206 26
##
## $Exterior2nd
## AsbShng AsphShn Brk Cmn BrkFace CBlock CmentBd HdBoard ImStucc MetalSd
## 20 3 7 25 1 60 207 10 214
## Other Plywood Stone Stucco VinylSd Wd Sdng Wd Shng
## 1 142 5 26 504 197 38
##
## $MasVnrType
## BrkCmn BrkFace None Stone NA's
## 15 445 864 128 8
##
## $MasVnrArea
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 103.7 166.0 1600.0 8
##
## $ExterQual
## Ex Fa Gd TA
## 52 14 488 906
##
## $ExterCond
## Ex Fa Gd Po TA
## 3 28 146 1 1282
##
## $Foundation
## BrkTil CBlock PConc Slab Stone Wood
## 146 634 647 24 6 3
##
## $BsmtQual
## Ex Fa Gd TA NA's
## 121 35 618 649 37
##
## $BsmtCond
## Fa Gd Po TA NA's
## 45 65 2 1311 37
##
## $BsmtExposure
## Av Gd Mn No NA's
## 221 134 114 953 38
##
## $BsmtFinType1
## ALQ BLQ GLQ LwQ Rec Unf NA's
## 220 148 418 74 133 430 37
##
## $BsmtFinSF1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 383.5 443.6 712.2 5644.0
##
## $BsmtFinType2
## ALQ BLQ GLQ LwQ Rec Unf NA's
## 19 33 14 46 54 1256 38
##
## $BsmtFinSF2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 46.55 0.00 1474.00
##
## $BsmtUnfSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 223.0 477.5 567.2 808.0 2336.0
##
## $TotalBsmtSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 795.8 991.5 1057.0 1298.0 6110.0
##
## $Heating
## Floor GasA GasW Grav OthW Wall
## 1 1428 18 7 2 4
##
## $HeatingQC
## Ex Fa Gd Po TA
## 741 49 241 1 428
##
## $CentralAir
## N Y
## 95 1365
##
## $Electrical
## FuseA FuseF FuseP Mix SBrkr NA's
## 94 27 3 1 1334 1
##
## $X1stFlrSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 882 1087 1163 1391 4692
##
## $X2ndFlrSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 347 728 2065
##
## $LowQualFinSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 5.845 0.000 572.000
##
## $GrLivArea
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
##
## $BsmtFullBath
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4253 1.0000 3.0000
##
## $BsmtHalfBath
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05753 0.00000 2.00000
##
## $FullBath
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.565 2.000 3.000
##
## $HalfBath
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3829 1.0000 2.0000
##
## $BedroomAbvGr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 2.866 3.000 8.000
##
## $KitchenAbvGr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.047 1.000 3.000
##
## $KitchenQual
## Ex Fa Gd TA
## 100 39 586 735
##
## $TotRmsAbvGrd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 5.000 6.000 6.518 7.000 14.000
##
## $Functional
## Maj1 Maj2 Min1 Min2 Mod Sev Typ
## 14 5 31 34 15 1 1360
##
## $Fireplaces
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 0.613 1.000 3.000
##
## $FireplaceQu
## Ex Fa Gd Po TA NA's
## 24 33 380 20 313 690
##
## $GarageType
## 2Types Attchd Basment BuiltIn CarPort Detchd NA's
## 6 870 19 88 9 387 81
##
## $GarageYrBlt
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1900 1961 1980 1979 2002 2010 81
##
## $GarageFinish
## Fin RFn Unf NA's
## 352 422 605 81
##
## $GarageCars
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.767 2.000 4.000
##
## $GarageArea
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 334.5 480.0 473.0 576.0 1418.0
##
## $GarageQual
## Ex Fa Gd Po TA NA's
## 3 48 14 3 1311 81
##
## $GarageCond
## Ex Fa Gd Po TA NA's
## 2 35 9 7 1326 81
##
## $PavedDrive
## N P Y
## 90 30 1340
##
## $WoodDeckSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 94.24 168.00 857.00
##
## $OpenPorchSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 25.00 46.66 68.00 547.00
##
## $EnclosedPorch
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 21.95 0.00 552.00
##
## $X3SsnPorch
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 3.41 0.00 508.00
##
## $ScreenPorch
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 15.06 0.00 480.00
##
## $PoolArea
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.759 0.000 738.000
##
## $PoolQC
## Ex Fa Gd NA's
## 2 2 3 1453
##
## $Fence
## GdPrv GdWo MnPrv MnWw NA's
## 59 54 157 11 1179
##
## $MiscFeature
## Gar2 Othr Shed TenC NA's
## 2 2 49 1 1406
##
## $MiscVal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 43.49 0.00 15500.00
##
## $MoSold
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.000 6.322 8.000 12.000
##
## $YrSold
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2006 2007 2008 2008 2009 2010
##
## $SaleType
## COD Con ConLD ConLI ConLw CWD New Oth WD
## 43 2 9 5 5 4 122 3 1267
##
## $SaleCondition
## Abnorml AdjLand Alloca Family Normal Partial
## 101 4 12 20 1198 125
##
## $SalePrice
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 130000 163000 180900 214000 755000
sapply(test, summary)
## $Id
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1461 1826 2190 2190 2554 2919
##
## $MSSubClass
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 20.00 50.00 57.38 70.00 190.00
##
## $MSZoning
## C (all) FV RH RL RM NA's
## 15 74 10 1114 242 4
##
## $LotFrontage
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 21.00 58.00 67.00 68.58 80.00 200.00 227
##
## $LotArea
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1470 7391 9399 9819 11520 56600
##
## $Street
## Grvl Pave
## 6 1453
##
## $Alley
## Grvl Pave NA's
## 70 37 1352
##
## $LotShape
## IR1 IR2 IR3 Reg
## 484 35 6 934
##
## $LandContour
## Bnk HLS Low Lvl
## 54 70 24 1311
##
## $Utilities
## AllPub NA's
## 1457 2
##
## $LotConfig
## Corner CulDSac FR2 FR3 Inside
## 248 82 38 10 1081
##
## $LandSlope
## Gtl Mod Sev
## 1396 60 3
##
## $Neighborhood
## Blmngtn Blueste BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert
## 11 8 14 50 16 117 52 94 86
## IDOTRR MeadowV Mitchel NAmes NoRidge NPkVill NridgHt NWAmes OldTown
## 56 20 65 218 30 14 89 58 126
## Sawyer SawyerW Somerst StoneBr SWISU Timber Veenker
## 77 66 96 26 23 34 13
##
## $Condition1
## Artery Feedr Norm PosA PosN RRAe RRAn RRNe RRNn
## 44 83 1251 12 20 17 24 4 4
##
## $Condition2
## Artery Feedr Norm PosA PosN
## 3 7 1444 3 2
##
## $BldgType
## 1Fam 2fmCon Duplex Twnhs TwnhsE
## 1205 31 57 53 113
##
## $HouseStyle
## 1.5Fin 1.5Unf 1Story 2.5Unf 2Story SFoyer SLvl
## 160 5 745 13 427 46 63
##
## $OverallQual
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.000 6.079 7.000 10.000
##
## $OverallCond
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 5.000 5.554 6.000 9.000
##
## $YearBuilt
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1879 1953 1973 1971 2001 2010
##
## $YearRemodAdd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1950 1963 1992 1984 2004 2010
##
## $RoofStyle
## Flat Gable Gambrel Hip Mansard Shed
## 7 1169 11 265 4 3
##
## $RoofMatl
## CompShg Tar&Grv WdShake WdShngl
## 1442 12 4 1
##
## $Exterior1st
## AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard MetalSd Plywood
## 24 1 4 37 1 65 220 230 113
## Stucco VinylSd Wd Sdng WdShing NA's
## 18 510 205 30 1
##
## $Exterior2nd
## AsbShng AsphShn Brk Cmn BrkFace CBlock CmentBd HdBoard ImStucc MetalSd
## 18 1 15 22 2 66 199 5 233
## Plywood Stone Stucco VinylSd Wd Sdng Wd Shng NA's
## 128 1 21 510 194 43 1
##
## $MasVnrType
## BrkCmn BrkFace None Stone NA's
## 10 434 878 121 16
##
## $MasVnrArea
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 100.7 164.0 1290.0 15
##
## $ExterQual
## Ex Fa Gd TA
## 55 21 491 892
##
## $ExterCond
## Ex Fa Gd Po TA
## 9 39 153 2 1256
##
## $Foundation
## BrkTil CBlock PConc Slab Stone Wood
## 165 601 661 25 5 2
##
## $BsmtQual
## Ex Fa Gd TA NA's
## 137 53 591 634 44
##
## $BsmtCond
## Fa Gd Po TA NA's
## 59 57 3 1295 45
##
## $BsmtExposure
## Av Gd Mn No NA's
## 197 142 125 951 44
##
## $BsmtFinType1
## ALQ BLQ GLQ LwQ Rec Unf NA's
## 209 121 431 80 155 421 42
##
## $BsmtFinSF1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 350.5 439.2 753.5 4010.0 1
##
## $BsmtFinType2
## ALQ BLQ GLQ LwQ Rec Unf NA's
## 33 35 20 41 51 1237 42
##
## $BsmtFinSF2
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.00 52.62 0.00 1526.00 1
##
## $BsmtUnfSF
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 219.2 460.0 554.3 797.8 2140.0 1
##
## $TotalBsmtSF
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 784 988 1046 1305 5095 1
##
## $Heating
## GasA GasW Grav Wall
## 1446 9 2 2
##
## $HeatingQC
## Ex Fa Gd Po TA
## 752 43 233 2 429
##
## $CentralAir
## N Y
## 101 1358
##
## $Electrical
## FuseA FuseF FuseP SBrkr
## 94 23 5 1337
##
## $X1stFlrSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 407.0 873.5 1079.0 1157.0 1382.0 5095.0
##
## $X2ndFlrSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 326 676 1862
##
## $LowQualFinSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 3.544 0.000 1064.000
##
## $GrLivArea
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 407 1118 1432 1486 1721 5095
##
## $BsmtFullBath
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.4345 1.0000 3.0000 2
##
## $BsmtHalfBath
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.0652 0.0000 2.0000 2
##
## $FullBath
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.571 2.000 4.000
##
## $HalfBath
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3777 1.0000 2.0000
##
## $BedroomAbvGr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 2.854 3.000 6.000
##
## $KitchenAbvGr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.042 1.000 2.000
##
## $KitchenQual
## Ex Fa Gd TA NA's
## 105 31 565 757 1
##
## $TotRmsAbvGrd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 6.385 7.000 15.000
##
## $Functional
## Maj1 Maj2 Min1 Min2 Mod Sev Typ NA's
## 5 4 34 36 20 1 1357 2
##
## $Fireplaces
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5812 1.0000 4.0000
##
## $FireplaceQu
## Ex Fa Gd Po TA NA's
## 19 41 364 26 279 730
##
## $GarageType
## 2Types Attchd Basment BuiltIn CarPort Detchd NA's
## 17 853 17 98 6 392 76
##
## $GarageYrBlt
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1895 1959 1979 1978 2002 2207 78
##
## $GarageFinish
## Fin RFn Unf NA's
## 367 389 625 78
##
## $GarageCars
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.000 2.000 1.766 2.000 5.000 1
##
## $GarageArea
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 318.0 480.0 472.8 576.0 1488.0 1
##
## $GarageQual
## Fa Gd Po TA NA's
## 76 10 2 1293 78
##
## $GarageCond
## Ex Fa Gd Po TA NA's
## 1 39 6 7 1328 78
##
## $PavedDrive
## N P Y
## 126 32 1301
##
## $WoodDeckSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 93.17 168.00 1424.00
##
## $OpenPorchSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 28.00 48.31 72.00 742.00
##
## $EnclosedPorch
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 24.24 0.00 1012.00
##
## $X3SsnPorch
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.794 0.000 360.000
##
## $ScreenPorch
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 17.06 0.00 576.00
##
## $PoolArea
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.744 0.000 800.000
##
## $PoolQC
## Ex Gd NA's
## 2 1 1456
##
## $Fence
## GdPrv GdWo MnPrv MnWw NA's
## 59 58 172 1 1169
##
## $MiscFeature
## Gar2 Othr Shed NA's
## 3 2 46 1408
##
## $MiscVal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 58.17 0.00 17000.00
##
## $MoSold
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 6.000 6.104 8.000 12.000
##
## $YrSold
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2006 2007 2008 2008 2009 2010
##
## $SaleType
## COD Con ConLD ConLI ConLw CWD New Oth WD NA's
## 44 3 17 4 3 8 117 4 1258 1
##
## $SaleCondition
## Abnorml AdjLand Alloca Family Normal Partial
## 89 8 12 26 1204 120
## count missing values in each variable in `train` and `test`
colSums(sapply(train, is.na))[colSums(sapply(train, is.na)) > 0]
## LotFrontage Alley MasVnrType MasVnrArea BsmtQual
## 259 1369 8 8 37
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 Electrical
## 37 38 37 38 1
## FireplaceQu GarageType GarageYrBlt GarageFinish GarageQual
## 690 81 81 81 81
## GarageCond PoolQC Fence MiscFeature
## 81 1453 1179 1406
colSums(sapply(test, is.na))[colSums(sapply(test, is.na)) > 0]
## MSZoning LotFrontage Alley Utilities Exterior1st
## 4 227 1352 2 1
## Exterior2nd MasVnrType MasVnrArea BsmtQual BsmtCond
## 1 16 15 44 45
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## 44 42 1 42 1
## BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual
## 1 1 2 2 1
## Functional FireplaceQu GarageType GarageYrBlt GarageFinish
## 2 730 76 78 78
## GarageCars GarageArea GarageQual GarageCond PoolQC
## 1 1 78 78 1456
## Fence MiscFeature SaleType
## 1169 1408 1
## check for duplicates
nrow(train) - nrow(unique(train))
## [1] 0
nrow(test) - nrow(unique(test))
## [1] 0
par(mfrow = c(2, 4))
for(row in 1:10) {
for (i in 1:4) {
j <- (row - 1) * 8 + i + 1
if (j == 80) {break}
if (is.numeric(train[[j]]) &
length(unique(train[[j]])) >= 12) {
plot(density(train[[j]], na.rm = TRUE),
main = colnames(train)[j])
} else {
barplot(prop.table(table(train[[j]])),
main = colnames(train)[j])
}
}
}
par(mfrow = c(2, 4))
for(row in 1:10) {
for (i in 5:8) {
j <- (row - 1) * 8 + i + 1
if (j == 80) {break}
if (is.numeric(train[[j]]) &
length(unique(train[[j]])) >= 12) {
plot(density(train[[j]], na.rm = TRUE),
main = colnames(train)[j])
} else {
barplot(prop.table(table(train[[j]])),
main = colnames(train)[j])
}
}
}
par(mfrow = c(2, 4))
for(row in 1:10) {
for (i in 1:8) {
j <- (row - 1) * 8 + i + 1
if (j == 80) {break}
plot(train[[j]], train$SalePrice, main = colnames(train)[j])
}
}
par(mfrow = c(1, 1))
library(corrplot)
cors <-
cor(train[sapply(train, is.numeric) &
sapply(train,
function(x) length(unique(x)) >= 5)][, -1],
use = "na.or.complete")
corrplot(cors, method = "square")
cors[, 31]
## MSSubClass LotFrontage LotArea OverallQual OverallCond
## -0.088031702 0.344269772 0.299962206 0.797880680 -0.124391232
## YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2
## 0.525393598 0.521253270 0.488658155 0.390300523 -0.028021366
## BsmtUnfSF TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF
## 0.213128680 0.615612237 0.607969106 0.306879002 -0.001481983
## GrLivArea BedroomAbvGr TotRmsAbvGrd GarageYrBlt GarageCars
## 0.705153567 0.166813894 0.547067360 0.504753018 0.647033611
## GarageArea WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0.619329622 0.336855121 0.343353812 -0.154843204 0.030776594
## ScreenPorch PoolArea MiscVal MoSold YrSold
## 0.110426815 0.092488120 -0.036041237 0.051568064 -0.011868823
## SalePrice
## 1.000000000
kv_class <-
data.frame(key = c(20, 30, 40, 45, 50, 60, 70, 75,
80, 85, 90, 120, 150, 160, 180, 190),
value = c("1StoryNew", "1StoryOld",
"1StoryAttic", "1.5StoryUnf",
"1.5StoryFin", "2StoryNew",
"2StoryOld", "2.5Story",
"SplitLevel", "SplitFoyer",
"Duplex", "1StoryPUD",
"1.5StoryPUD", "2StoryPUD",
"MultiLevelPUD", "TwoFamConvert")
)
replace_missing <- function(dataset) {
df <- dataset
i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)
df$MSSubClass <-
sapply(df$MSSubClass,
function(x) kv_class[kv_class$key == x, ]$value)
df$MSZoning[is.na(df$MSZoning)] <- "RL"
df$LotFrontage[is.na(df$LotFrontage)] <- median(df$LotFrontage, na.rm = TRUE)
df$Alley[is.na(df$Alley)] <- "None"
df$Utilities[is.na(df$Utilities)] <- "AllPub"
df$Exterior1st[is.na(df$Exterior1st)] <- "VinylSd"
df$Exterior2nd[is.na(df$Exterior2nd)] <- "VinylSd"
df$MasVnrType[is.na(df$MasVnrType)] <- "None"
df$MasVnrArea[is.na(df$MasVnrArea)] <- 0
df$BsmtQual[is.na(df$BsmtQual)] <- "None"
df$BsmtCond[is.na(df$BsmtCond)] <- "None"
df$BsmtExposure[is.na(df$BsmtExposure)] <- "None"
df$BsmtFinType1[is.na(df$BsmtFinType1)] <- "None"
df$BsmtFinSF1[is.na(df$BsmtFinSF1)] <- 0
df$BsmtFinType2[is.na(df$BsmtFinType2)] <- "None"
df$BsmtFinSF2[is.na(df$BsmtFinSF2)] <- 0
df$BsmtUnfSF[is.na(df$BsmtUnfSF)] <- 0
df$TotalBsmtSF[is.na(df$TotalBsmtSF)] <- 0
df$Electrical[is.na(df$Electrical)] <- "SBrkr"
df$BsmtFullBath[is.na(df$BsmtFullBath)] <- 0
df$BsmtHalfBath[is.na(df$BsmtHalfBath)] <- 0
df$KitchenQual[is.na(df$KitchenQual)] <- "TA"
df$Functional[is.na(df$Functional)] <- "Typ"
df$FireplaceQu[is.na(df$FireplaceQu)] <- "None"
df$GarageType[is.na(df$GarageType)] <- "None"
df$GarageYrBlt[is.na(df$GarageYrBlt)] <- min(df$GarageYrBlt, na.rm = TRUE)
df$GarageFinish[is.na(df$GarageFinish)] <- "None"
df$GarageCars[is.na(df$GarageCars)] <- 0
df$GarageArea[is.na(df$GarageArea)] <- 0
df$GarageQual[is.na(df$GarageQual)] <- "None"
df$GarageCond[is.na(df$GarageCond)] <- "None"
df$PoolQC[is.na(df$PoolQC)] <- "None"
df$Fence[is.na(df$Fence)] <- "None"
df$MiscFeature[is.na(df$MiscFeature)] <- "None"
df$SaleType[is.na(df$SaleType)] <- "WD"
i <- sapply(df, is.character)
df[i] <- lapply(df[i], as.factor)
return(df)
}
kv_bldg_type <-
data.frame(key = c("2fmCon", "Duplex", "Twnhs", "TwnhsE", "1Fam"),
value = 1:5
)
kv_ext_qual <-
data.frame(key = c("Po", "Fa", "TA", "Gd", "Ex"),
value = 1:5)
kv_ext_cond <-
data.frame(key = c("Po", "Fa", "TA", "Gd", "Ex"),
value = 1:5)
kv_bsmt_qual <-
data.frame(key = c("None", "Po", "Fa", "TA", "Gd", "Ex"),
value = 0:5)
kv_bsmt_cond <-
data.frame(key = c("Po", "None", "Fa", "TA", "Gd", "Ex"),
value = 0:5)
kv_bsmt_exp <-
data.frame(key = c("None", "No", "Mn", "Av", "Gd"),
value = 0:4)
kv_heat_qc <-
data.frame(key = c("Po", "Fa", "TA", "Gd", "Ex"),
value = 1:5)
kv_electrical <-
data.frame(key = c("Mix", "FuseP", "FuseF", "FuseA", "SBrkr"),
value = 1:5)
kv_kitchen <-
data.frame(key = c("Po", "Fa", "TA", "Gd", "Ex"),
value = 1:5)
kv_fireplace_q <-
data.frame(key = c("Po", "None", "Fa", "TA", "Gd", "Ex"),
value = 0:5)
kv_garage_fin <-
data.frame(key = c("None", "Unf", "RFn", "Fin"),
value = 0:3)
kv_paved_drive <-
data.frame(key = c("N", "P", "Y"), value = 1:3)
recode <- function(dataset) {
# categorical
df <- dataset
i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)
df$BldgType <-
sapply(df$BldgType,
function(x) kv_bldg_type[kv_bldg_type$key == x, ]$value)
df$ExterQual <-
sapply(df$ExterQual,
function(x) kv_ext_qual[kv_ext_qual$key == x, ]$value)
df$ExterCond <-
sapply(df$ExterCond,
function(x) kv_ext_cond[kv_ext_cond$key == x, ]$value)
df$BsmtQual <-
sapply(df$BsmtQual,
function(x) kv_bsmt_qual[kv_bsmt_qual$key == x, ]$value)
df$BsmtCond <-
sapply(df$BsmtCond,
function(x) kv_bsmt_cond[kv_bsmt_cond$key == x, ]$value)
df$BsmtExposure <-
sapply(df$BsmtExposure,
function(x) kv_bsmt_exp[kv_bsmt_exp$key == x, ]$value)
df$BsmtFinType1 <-
ifelse(df$BsmtFinType1 == "GLQ", 2,
ifelse(df$BsmtFinType1 == "None", 0, 1))
df$HeatingQC <-
sapply(df$HeatingQC,
function(x) kv_heat_qc[kv_heat_qc$key == x, ]$value)
df$CentralAir <- ifelse(df$CentralAir == "Y", 1, 0)
df$Electrical <-
sapply(df$Electrical,
function(x) kv_electrical[kv_electrical$key == x, ]$value)
df$KitchenQual <-
sapply(df$KitchenQual,
function(x) kv_kitchen[kv_kitchen$key == x, ]$value)
df$FireplaceQu <-
sapply(df$FireplaceQu,
function(x) kv_fireplace_q[kv_fireplace_q$key == x, ]$value)
df$GarageType <-
ifelse(df$GarageType %in%
c("2Types", "Attchd", "Basment", "BuiltIn"), 1, 0)
df$GarageFinish <-
sapply(df$GarageFinish,
function(x) kv_garage_fin[kv_garage_fin$key == x, ]$value)
df$GarageQual <-
ifelse(df$GarageQual %in% c("Ex", "Gd", "TA"), 1, 0)
df$GarageCond <-
ifelse(df$GarageCond %in% c("Ex", "Gd", "TA"), 1, 0)
df$PavedDrive <-
sapply(df$PavedDrive,
function(x) kv_paved_drive[kv_paved_drive$key == x, ]$value)
df$PoolQC <- ifelse(df$PoolQC == "Ex", 1, 0)
df$MoSold <- sapply(df$MoSold, function(x) month.name[x])
i <- sapply(df, is.character)
df[i] <- lapply(df[i], as.factor)
# binary coding
df$MasVnrArea <- ifelse(df$MasVnrArea > 0, 1, 0)
df$MiscVal <- ifelse(df$MiscVal > 0, 1, 0)
df$X3SsnPorch <- ifelse(df$X3SsnPorch > 0, 1, 0)
df$ScreenPorch <- ifelse(df$ScreenPorch > 0, 1, 0)
df$LowQualFinSF <- ifelse(df$LowQualFinSF > 0, 0, 1)
## log transform
df$LotArea <- log(df$LotArea)
df$GrLivArea <- log(df$GrLivArea)
return(df)
}
drop_outliers <- function(dataset) {
df <- dataset
df <- df[df$BsmtFinSF1 < 5000, ]
df <- df[df$X1stFlrSF < 4000, ]
return(df)
}
library(caret)
## Warning: package 'caret' was built under R version 3.3.2
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:psychometric':
##
## alpha
train <- replace_missing(train)
test <- replace_missing(test)
train_facts <- sapply(train[colnames(train[sapply(train, is.factor)])], function(x) sort(unique(x[!is.na(x)])))
test_facts <- sapply(test[colnames(train[sapply(train, is.factor)])], function(x) sort(unique(x[!is.na(x)])))
for (i in 1:length(test_facts)) {
if (length(setdiff(test_facts[[i]], train_facts[[i]])) > 0) {
print(names(test_facts)[i])
}
}
## [1] "MSSubClass"
unique(train$MSSubClass)
## [1] 2StoryNew 1StoryNew 2StoryOld 1.5StoryFin TwoFamConvert
## [6] 1.5StoryUnf Duplex 1StoryPUD 1StoryOld SplitFoyer
## [11] SplitLevel 2StoryPUD 2.5Story MultiLevelPUD 1StoryAttic
## 16 Levels: 1.5StoryFin 1.5StoryPUD 1.5StoryUnf 1StoryAttic ... TwoFamConvert
unique(test$MSSubClass)
## [1] 1StoryNew 2StoryNew 1StoryPUD 2StoryPUD SplitLevel
## [6] 1StoryOld 1.5StoryFin Duplex SplitFoyer TwoFamConvert
## [11] 1.5StoryUnf 2StoryOld 2.5Story MultiLevelPUD 1StoryAttic
## [16] 1.5StoryPUD
## 16 Levels: 1.5StoryFin 1.5StoryPUD 1.5StoryUnf 1StoryAttic ... TwoFamConvert
train <- recode(train)
test <- recode(test)
train <- drop_outliers(train)
colSums(sapply(train, is.na))[colSums(sapply(train, is.na)) > 0]
## named numeric(0)
colSums(sapply(test, is.na))[colSums(sapply(test, is.na)) > 0]
## named numeric(0)
## dummy code categorical variables in `train` and `test` datasets
dummies <- dummyVars("~ .", data = rbind(train[, -ncol(train)], test))
SalePrice <- data.frame(Id = train$Id, SalePrice = train$SalePrice)
train <- as.data.frame(predict(dummies, newdata = train))
train <- merge(train, SalePrice, by = "Id")
test <- as.data.frame(predict(dummies, newdata = test))
(nzv <- nearZeroVar(train, saveMetrics = TRUE))
## freqRatio percentUnique zeroVar nzv
## Id 1.000000 100.0000000 FALSE FALSE
## MSSubClass.1.5StoryFin 9.131944 0.1370802 FALSE FALSE
## MSSubClass.1.5StoryUnf 120.583333 0.1370802 FALSE TRUE
## MSSubClass.1StoryAttic 363.750000 0.1370802 FALSE TRUE
## MSSubClass.1StoryNew 1.722015 0.1370802 FALSE FALSE
## MSSubClass.1StoryOld 20.144928 0.1370802 FALSE TRUE
## MSSubClass.1StoryPUD 15.770115 0.1370802 FALSE FALSE
## MSSubClass.2.5Story 90.187500 0.1370802 FALSE TRUE
## MSSubClass.2StoryNew 3.895973 0.1370802 FALSE FALSE
## MSSubClass.2StoryOld 23.316667 0.1370802 FALSE TRUE
## MSSubClass.2StoryPUD 22.158730 0.1370802 FALSE TRUE
## MSSubClass.Duplex 27.057692 0.1370802 FALSE TRUE
## MSSubClass.MultiLevelPUD 144.900000 0.1370802 FALSE TRUE
## MSSubClass.SplitFoyer 71.950000 0.1370802 FALSE TRUE
## MSSubClass.SplitLevel 24.155172 0.1370802 FALSE TRUE
## MSSubClass.TwoFamConvert 47.633333 0.1370802 FALSE TRUE
## MSSubClass.1.5StoryPUD 0.000000 0.0685401 TRUE TRUE
## MSZoning.C (all) 144.900000 0.1370802 FALSE TRUE
## MSZoning.FV 21.446154 0.1370802 FALSE TRUE
## MSZoning.RH 90.187500 0.1370802 FALSE TRUE
## MSZoning.RL 3.721683 0.1370802 FALSE FALSE
## MSZoning.RM 5.692661 0.1370802 FALSE FALSE
## LotFrontage 1.888112 7.5394106 FALSE FALSE
## LotArea 1.041667 73.4749829 FALSE FALSE
## Street.Grvl 242.166667 0.1370802 FALSE TRUE
## Street.Pave 242.166667 0.1370802 FALSE TRUE
## Alley.Grvl 28.180000 0.1370802 FALSE TRUE
## Alley.None 15.032967 0.1370802 FALSE FALSE
## Alley.Pave 34.585366 0.1370802 FALSE TRUE
## LotShape.IR1 2.014463 0.1370802 FALSE FALSE
## LotShape.IR2 34.585366 0.1370802 FALSE TRUE
## LotShape.IR3 161.111111 0.1370802 FALSE TRUE
## LotShape.Reg 1.732210 0.1370802 FALSE FALSE
## LandContour.Bnk 22.532258 0.1370802 FALSE TRUE
## LandContour.HLS 28.180000 0.1370802 FALSE TRUE
## LandContour.Low 39.527778 0.1370802 FALSE TRUE
## LandContour.Lvl 8.858108 0.1370802 FALSE FALSE
## Utilities.AllPub 1458.000000 0.1370802 FALSE TRUE
## Utilities.NoSeWa 1458.000000 0.1370802 FALSE TRUE
## LotConfig.Corner 4.568702 0.1370802 FALSE FALSE
## LotConfig.CulDSac 14.521277 0.1370802 FALSE FALSE
## LotConfig.FR2 30.042553 0.1370802 FALSE TRUE
## LotConfig.FR3 363.750000 0.1370802 FALSE TRUE
## LotConfig.Inside 2.584767 0.1370802 FALSE FALSE
## LandSlope.Gtl 17.705128 0.1370802 FALSE FALSE
## LandSlope.Mod 21.446154 0.1370802 FALSE TRUE
## LandSlope.Sev 111.230769 0.1370802 FALSE TRUE
## Neighborhood.Blmngtn 84.823529 0.1370802 FALSE TRUE
## Neighborhood.Blueste 728.500000 0.1370802 FALSE TRUE
## Neighborhood.BrDale 90.187500 0.1370802 FALSE TRUE
## Neighborhood.BrkSide 24.155172 0.1370802 FALSE TRUE
## Neighborhood.ClearCr 51.107143 0.1370802 FALSE TRUE
## Neighborhood.CollgCr 8.726667 0.1370802 FALSE FALSE
## Neighborhood.Crawfor 27.607843 0.1370802 FALSE TRUE
## Neighborhood.Edwards 13.737374 0.1370802 FALSE FALSE
## Neighborhood.Gilbert 17.468354 0.1370802 FALSE FALSE
## Neighborhood.IDOTRR 38.432432 0.1370802 FALSE TRUE
## Neighborhood.MeadowV 84.823529 0.1370802 FALSE TRUE
## Neighborhood.Mitchel 28.775510 0.1370802 FALSE TRUE
## Neighborhood.NAmes 5.484444 0.1370802 FALSE FALSE
## Neighborhood.NoRidge 34.585366 0.1370802 FALSE TRUE
## Neighborhood.NPkVill 161.111111 0.1370802 FALSE TRUE
## Neighborhood.NridgHt 17.948052 0.1370802 FALSE FALSE
## Neighborhood.NWAmes 18.986301 0.1370802 FALSE FALSE
## Neighborhood.OldTown 11.911504 0.1370802 FALSE FALSE
## Neighborhood.Sawyer 18.716216 0.1370802 FALSE FALSE
## Neighborhood.SawyerW 23.728814 0.1370802 FALSE TRUE
## Neighborhood.Somerst 15.965116 0.1370802 FALSE FALSE
## Neighborhood.StoneBr 57.360000 0.1370802 FALSE TRUE
## Neighborhood.SWISU 57.360000 0.1370802 FALSE TRUE
## Neighborhood.Timber 37.394737 0.1370802 FALSE TRUE
## Neighborhood.Veenker 131.636364 0.1370802 FALSE TRUE
## Condition1.Artery 29.395833 0.1370802 FALSE TRUE
## Condition1.Feedr 17.237500 0.1370802 FALSE FALSE
## Condition1.Norm 6.331658 0.1370802 FALSE FALSE
## Condition1.PosA 181.375000 0.1370802 FALSE TRUE
## Condition1.PosN 75.789474 0.1370802 FALSE TRUE
## Condition1.RRAe 131.636364 0.1370802 FALSE TRUE
## Condition1.RRAn 55.115385 0.1370802 FALSE TRUE
## Condition1.RRNe 728.500000 0.1370802 FALSE TRUE
## Condition1.RRNn 290.800000 0.1370802 FALSE TRUE
## Condition2.Artery 728.500000 0.1370802 FALSE TRUE
## Condition2.Feedr 242.166667 0.1370802 FALSE TRUE
## Condition2.Norm 96.266667 0.1370802 FALSE TRUE
## Condition2.PosA 1458.000000 0.1370802 FALSE TRUE
## Condition2.PosN 728.500000 0.1370802 FALSE TRUE
## Condition2.RRAe 1458.000000 0.1370802 FALSE TRUE
## Condition2.RRAn 1458.000000 0.1370802 FALSE TRUE
## Condition2.RRNn 728.500000 0.1370802 FALSE TRUE
## BldgType 10.692982 0.3427005 FALSE FALSE
## HouseStyle.1.5Fin 8.474026 0.1370802 FALSE FALSE
## HouseStyle.1.5Unf 103.214286 0.1370802 FALSE TRUE
## HouseStyle.1Story 1.009642 0.1370802 FALSE FALSE
## HouseStyle.2.5Fin 181.375000 0.1370802 FALSE TRUE
## HouseStyle.2.5Unf 131.636364 0.1370802 FALSE TRUE
## HouseStyle.2Story 2.286036 0.1370802 FALSE FALSE
## HouseStyle.SFoyer 38.432432 0.1370802 FALSE TRUE
## HouseStyle.SLvl 21.446154 0.1370802 FALSE TRUE
## OverallQual 1.061497 0.6854010 FALSE FALSE
## OverallCond 3.253968 0.6168609 FALSE FALSE
## YearBuilt 1.046875 7.6764907 FALSE FALSE
## YearRemodAdd 1.835052 4.1809459 FALSE FALSE
## RoofStyle.Flat 111.230769 0.1370802 FALSE TRUE
## RoofStyle.Gable 3.588050 0.1370802 FALSE FALSE
## RoofStyle.Gambrel 131.636364 0.1370802 FALSE TRUE
## RoofStyle.Hip 4.119298 0.1370802 FALSE FALSE
## RoofStyle.Mansard 207.428571 0.1370802 FALSE TRUE
## RoofStyle.Shed 728.500000 0.1370802 FALSE TRUE
## RoofMatl.ClyTile 0.000000 0.0685401 TRUE TRUE
## RoofMatl.CompShg 57.360000 0.1370802 FALSE TRUE
## RoofMatl.Membran 1458.000000 0.1370802 FALSE TRUE
## RoofMatl.Metal 1458.000000 0.1370802 FALSE TRUE
## RoofMatl.Roll 1458.000000 0.1370802 FALSE TRUE
## RoofMatl.Tar&Grv 131.636364 0.1370802 FALSE TRUE
## RoofMatl.WdShake 290.800000 0.1370802 FALSE TRUE
## RoofMatl.WdShngl 242.166667 0.1370802 FALSE TRUE
## Exterior1st.AsbShng 71.950000 0.1370802 FALSE TRUE
## Exterior1st.AsphShn 1458.000000 0.1370802 FALSE TRUE
## Exterior1st.BrkComm 728.500000 0.1370802 FALSE TRUE
## Exterior1st.BrkFace 28.180000 0.1370802 FALSE TRUE
## Exterior1st.CBlock 1458.000000 0.1370802 FALSE TRUE
## Exterior1st.CemntBd 22.918033 0.1370802 FALSE TRUE
## Exterior1st.HdBoard 5.572072 0.1370802 FALSE FALSE
## Exterior1st.ImStucc 1458.000000 0.1370802 FALSE TRUE
## Exterior1st.MetalSd 5.631818 0.1370802 FALSE FALSE
## Exterior1st.Plywood 12.509259 0.1370802 FALSE FALSE
## Exterior1st.Stone 728.500000 0.1370802 FALSE TRUE
## Exterior1st.Stucco 59.791667 0.1370802 FALSE TRUE
## Exterior1st.VinylSd 1.833010 0.1370802 FALSE FALSE
## Exterior1st.Wd Sdng 6.082524 0.1370802 FALSE FALSE
## Exterior1st.WdShing 55.115385 0.1370802 FALSE TRUE
## Exterior2nd.AsbShng 71.950000 0.1370802 FALSE TRUE
## Exterior2nd.AsphShn 485.333333 0.1370802 FALSE TRUE
## Exterior2nd.Brk Cmn 207.428571 0.1370802 FALSE TRUE
## Exterior2nd.BrkFace 57.360000 0.1370802 FALSE TRUE
## Exterior2nd.CBlock 1458.000000 0.1370802 FALSE TRUE
## Exterior2nd.CmentBd 23.316667 0.1370802 FALSE TRUE
## Exterior2nd.HdBoard 6.048309 0.1370802 FALSE FALSE
## Exterior2nd.ImStucc 144.900000 0.1370802 FALSE TRUE
## Exterior2nd.MetalSd 5.817757 0.1370802 FALSE FALSE
## Exterior2nd.Other 1458.000000 0.1370802 FALSE TRUE
## Exterior2nd.Plywood 9.274648 0.1370802 FALSE FALSE
## Exterior2nd.Stone 290.800000 0.1370802 FALSE TRUE
## Exterior2nd.Stucco 57.360000 0.1370802 FALSE TRUE
## Exterior2nd.VinylSd 1.894841 0.1370802 FALSE FALSE
## Exterior2nd.Wd Sdng 6.406091 0.1370802 FALSE FALSE
## Exterior2nd.Wd Shng 37.394737 0.1370802 FALSE TRUE
## MasVnrType.BrkCmn 96.266667 0.1370802 FALSE TRUE
## MasVnrType.BrkFace 2.278652 0.1370802 FALSE FALSE
## MasVnrType.None 1.485520 0.1370802 FALSE FALSE
## MasVnrType.Stone 10.488189 0.1370802 FALSE FALSE
## MasVnrArea 1.472881 0.1370802 FALSE FALSE
## ExterQual 1.856557 0.2741604 FALSE FALSE
## ExterCond 8.773973 0.3427005 FALSE FALSE
## Foundation.BrkTil 8.993151 0.1370802 FALSE FALSE
## Foundation.CBlock 1.301262 0.1370802 FALSE FALSE
## Foundation.PConc 1.258514 0.1370802 FALSE FALSE
## Foundation.Slab 59.791667 0.1370802 FALSE TRUE
## Foundation.Stone 242.166667 0.1370802 FALSE TRUE
## Foundation.Wood 485.333333 0.1370802 FALSE TRUE
## BsmtQual 1.050162 0.3427005 FALSE FALSE
## BsmtCond 20.153846 0.3427005 FALSE TRUE
## BsmtExposure 4.312217 0.3427005 FALSE FALSE
## BsmtFinType1 2.410072 0.2056203 FALSE FALSE
## BsmtFinSF1 38.916667 43.5915010 FALSE FALSE
## BsmtFinType2.ALQ 75.789474 0.1370802 FALSE TRUE
## BsmtFinType2.BLQ 43.212121 0.1370802 FALSE TRUE
## BsmtFinType2.GLQ 103.214286 0.1370802 FALSE TRUE
## BsmtFinType2.LwQ 30.717391 0.1370802 FALSE TRUE
## BsmtFinType2.None 37.394737 0.1370802 FALSE TRUE
## BsmtFinType2.Rec 26.018519 0.1370802 FALSE TRUE
## BsmtFinType2.Unf 6.151961 0.1370802 FALSE FALSE
## BsmtFinSF2 258.400000 9.8697738 FALSE TRUE
## BsmtUnfSF 13.111111 53.4612748 FALSE FALSE
## TotalBsmtSF 1.057143 49.3488691 FALSE FALSE
## Heating.Floor 1458.000000 0.1370802 FALSE TRUE
## Heating.GasA 44.593750 0.1370802 FALSE TRUE
## Heating.GasW 80.055556 0.1370802 FALSE TRUE
## Heating.Grav 207.428571 0.1370802 FALSE TRUE
## Heating.OthW 728.500000 0.1370802 FALSE TRUE
## Heating.Wall 363.750000 0.1370802 FALSE TRUE
## HeatingQC 1.728972 0.3427005 FALSE FALSE
## CentralAir 14.357895 0.1370802 FALSE FALSE
## Electrical 14.191489 0.3427005 FALSE FALSE
## X1stFlrSF 1.562500 51.5421522 FALSE FALSE
## X2ndFlrSF 82.900000 28.5126799 FALSE FALSE
## LowQualFinSF 55.115385 0.1370802 FALSE TRUE
## GrLivArea 1.571429 58.9444825 FALSE FALSE
## BsmtFullBath 1.455782 0.2741604 FALSE FALSE
## BsmtHalfBath 17.212500 0.2056203 FALSE FALSE
## FullBath 1.180000 0.2741604 FALSE FALSE
## HalfBath 1.709738 0.2056203 FALSE FALSE
## BedroomAbvGr 2.243017 0.5483208 FALSE FALSE
## KitchenAbvGr 21.400000 0.2741604 FALSE TRUE
## KitchenQual 1.254266 0.2741604 FALSE FALSE
## TotRmsAbvGrd 1.221884 0.8224812 FALSE FALSE
## Functional.Maj1 103.214286 0.1370802 FALSE TRUE
## Functional.Maj2 290.800000 0.1370802 FALSE TRUE
## Functional.Min1 46.064516 0.1370802 FALSE TRUE
## Functional.Min2 41.911765 0.1370802 FALSE TRUE
## Functional.Mod 96.266667 0.1370802 FALSE TRUE
## Functional.Sev 1458.000000 0.1370802 FALSE TRUE
## Functional.Typ 13.590000 0.1370802 FALSE FALSE
## Fireplaces 1.061538 0.2741604 FALSE FALSE
## FireplaceQu 1.820580 0.4112406 FALSE FALSE
## GarageType 2.058700 0.1370802 FALSE FALSE
## GarageYrBlt 1.261538 6.6483893 FALSE FALSE
## GarageFinish 1.433649 0.2741604 FALSE FALSE
## GarageCars 2.230352 0.3427005 FALSE FALSE
## GarageArea 1.653061 30.1576422 FALSE FALSE
## GarageQual 10.053030 0.1370802 FALSE FALSE
## GarageCond 10.861789 0.1370802 FALSE FALSE
## PavedDrive 14.877778 0.2056203 FALSE FALSE
## WoodDeckSF 20.026316 18.7799863 FALSE FALSE
## OpenPorchSF 22.620690 13.7765593 FALSE FALSE
## EnclosedPorch 83.400000 8.2248115 FALSE TRUE
## X3SsnPorch 59.791667 0.1370802 FALSE TRUE
## ScreenPorch 11.577586 0.1370802 FALSE FALSE
## PoolArea 1453.000000 0.4797807 FALSE TRUE
## PoolQC 728.500000 0.1370802 FALSE TRUE
## Fence.GdPrv 23.728814 0.1370802 FALSE TRUE
## Fence.GdWo 26.018519 0.1370802 FALSE TRUE
## Fence.MnPrv 8.292994 0.1370802 FALSE FALSE
## Fence.MnWw 131.636364 0.1370802 FALSE TRUE
## Fence.None 4.192171 0.1370802 FALSE FALSE
## MiscFeature.Gar2 728.500000 0.1370802 FALSE TRUE
## MiscFeature.None 26.018519 0.1370802 FALSE TRUE
## MiscFeature.Othr 728.500000 0.1370802 FALSE TRUE
## MiscFeature.Shed 28.775510 0.1370802 FALSE TRUE
## MiscFeature.TenC 1458.000000 0.1370802 FALSE TRUE
## MiscVal 27.057692 0.1370802 FALSE TRUE
## MoSold.April 9.347518 0.1370802 FALSE FALSE
## MoSold.August 10.959016 0.1370802 FALSE FALSE
## MoSold.December 23.728814 0.1370802 FALSE TRUE
## MoSold.February 27.057692 0.1370802 FALSE TRUE
## MoSold.January 24.596491 0.1370802 FALSE TRUE
## MoSold.July 5.235043 0.1370802 FALSE FALSE
## MoSold.June 4.766798 0.1370802 FALSE FALSE
## MoSold.March 12.764151 0.1370802 FALSE FALSE
## MoSold.May 6.151961 0.1370802 FALSE FALSE
## MoSold.November 17.468354 0.1370802 FALSE FALSE
## MoSold.October 15.393258 0.1370802 FALSE FALSE
## MoSold.September 22.158730 0.1370802 FALSE TRUE
## YrSold 1.027356 0.3427005 FALSE FALSE
## SaleType.COD 32.930233 0.1370802 FALSE TRUE
## SaleType.Con 728.500000 0.1370802 FALSE TRUE
## SaleType.ConLD 161.111111 0.1370802 FALSE TRUE
## SaleType.ConLI 290.800000 0.1370802 FALSE TRUE
## SaleType.ConLw 290.800000 0.1370802 FALSE TRUE
## SaleType.CWD 363.750000 0.1370802 FALSE TRUE
## SaleType.New 11.057851 0.1370802 FALSE FALSE
## SaleType.Oth 485.333333 0.1370802 FALSE TRUE
## SaleType.WD 6.598958 0.1370802 FALSE FALSE
## SaleCondition.Abnorml 13.445545 0.1370802 FALSE FALSE
## SaleCondition.AdjLand 363.750000 0.1370802 FALSE TRUE
## SaleCondition.Alloca 120.583333 0.1370802 FALSE TRUE
## SaleCondition.Family 71.950000 0.1370802 FALSE TRUE
## SaleCondition.Normal 4.590038 0.1370802 FALSE FALSE
## SaleCondition.Partial 10.766129 0.1370802 FALSE FALSE
## SalePrice 1.176471 45.4420836 FALSE FALSE
nzv_cols <- row.names(nzv[!grepl("Neighborhood", row.names(nzv)) & nzv$nzv, ])
if(length(nzv_cols) > 0) {
train <- train[, -which(names(train) %in% nzv_cols)]
}
## identify and remove highly correlated predictors from training set
cor_preds <- cor(train[, -which(names(train) == "SalePrice")])
high_cor <- findCorrelation(cor_preds, cutoff = 0.80)
which(colnames(train) %in%
c("GrLivArea", "TotalBsmtSF",
"GarageCars", "FireplaceQu"))
## [1] 80 86 96 100
high_cor <- high_cor[!high_cor %in% c(80, 86, 96, 100)]
train <- train[, -high_cor]
## partition training set for model testing on known sale prices
split <- createDataPartition(train$SalePrice, p = 0.75, list = FALSE)
training <- train[split, ]
testing <- train[-split, ]
mods <- modelLookup()
mods <- mods[mods$forReg == TRUE, ]
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(xgboost)
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
library(doParallel)
## Loading required package: iterators
## Loading required package: parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
## model training and evaluation on partitioned `train` data
err <- data.frame(model = character(0), rmse = numeric(0),
stringsAsFactors = FALSE)
# set.seed(2017)
lm_fit <- train(log(SalePrice) ~ . - Id, data = training,
method = "lm",
preProc = c("center", "scale"),
trControl = fitControl)
lm_fit
## Linear Regression
##
## 1096 samples
## 110 predictor
##
## Pre-processing: centered (109), scaled (109)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 987, 985, 986, 986, 988, 985, ...
## Resampling results:
##
## RMSE Rsquared
## 0.1265347 0.8994076
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
testing$SalePredict1 <- exp(predict(lm_fit, testing, na.action = na.pass))
RMSE(log(testing$SalePredict1), log(testing$SalePrice))
## [1] 0.1084855
sqrt(sum((log(testing$SalePredict1) - log(testing$SalePrice))^2) /
nrow(testing))
## [1] 0.1084855
err[nrow(err) + 1, ] <-
c("lm",
RMSE(log(testing$SalePredict1),
log(testing$SalePrice))
)
summary(lm_fit)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10024 -0.04895 0.00052 0.05953 0.42364
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.0236951 0.0035326 3403.599 < 2e-16 ***
## MSSubClass.1.5StoryFin -0.0015951 0.0049810 -0.320 0.748849
## MSSubClass.1StoryNew 0.0150135 0.0093709 1.602 0.109446
## MSSubClass.1StoryPUD 0.0058897 0.0065264 0.902 0.367038
## MSSubClass.2StoryNew -0.0094810 0.0081829 -1.159 0.246886
## MSZoning.RL 0.0066192 0.0067510 0.980 0.327092
## LotFrontage 0.0010386 0.0051466 0.202 0.840108
## LotArea 0.0414283 0.0067654 6.124 1.32e-09 ***
## Alley.None 0.0023885 0.0044027 0.543 0.587591
## LotShape.IR1 -0.0055124 0.0041212 -1.338 0.181351
## LandContour.Lvl 0.0050235 0.0047689 1.053 0.292424
## LotConfig.Corner 0.0216500 0.0086729 2.496 0.012712 *
## LotConfig.CulDSac 0.0165252 0.0062359 2.650 0.008178 **
## LotConfig.Inside 0.0156417 0.0094183 1.661 0.097074 .
## LandSlope.Gtl -0.0035353 0.0047925 -0.738 0.460881
## Neighborhood.Blmngtn 0.0018639 0.0067879 0.275 0.783692
## Neighborhood.Blueste -0.0013947 0.0040120 -0.348 0.728185
## Neighborhood.BrDale -0.0056862 0.0058493 -0.972 0.331225
## Neighborhood.BrkSide 0.0030127 0.0097935 0.308 0.758439
## Neighborhood.ClearCr 0.0056988 0.0071723 0.795 0.427059
## Neighborhood.CollgCr -0.0068298 0.0151880 -0.450 0.653035
## Neighborhood.Crawfor 0.0178130 0.0096671 1.843 0.065681 .
## Neighborhood.Edwards -0.0203691 0.0121556 -1.676 0.094114 .
## Neighborhood.Gilbert -0.0066493 0.0100084 -0.664 0.506608
## Neighborhood.IDOTRR -0.0220977 0.0080829 -2.734 0.006371 **
## Neighborhood.MeadowV -0.0131950 0.0068287 -1.932 0.053612 .
## Neighborhood.Mitchel -0.0118261 0.0084846 -1.394 0.163683
## Neighborhood.NAmes -0.0086617 0.0172791 -0.501 0.616283
## Neighborhood.NoRidge 0.0122484 0.0083247 1.471 0.141519
## Neighborhood.NPkVill 0.0012465 0.0051260 0.243 0.807920
## Neighborhood.NridgHt 0.0155357 0.0113292 1.371 0.170596
## Neighborhood.NWAmes -0.0059288 0.0106598 -0.556 0.578213
## Neighborhood.OldTown -0.0133798 0.0140679 -0.951 0.341792
## Neighborhood.Sawyer -0.0097005 0.0105643 -0.918 0.358718
## Neighborhood.SawyerW -0.0025334 0.0097870 -0.259 0.795805
## Neighborhood.Somerst 0.0150595 0.0119355 1.262 0.207339
## Neighborhood.StoneBr 0.0150349 0.0066443 2.263 0.023864 *
## Neighborhood.SWISU 0.0020092 0.0074797 0.269 0.788279
## Neighborhood.Timber -0.0054680 0.0081967 -0.667 0.504870
## Neighborhood.Veenker NA NA NA NA
## Condition1.Feedr 0.0002787 0.0047156 0.059 0.952879
## Condition1.Norm 0.0195993 0.0048026 4.081 4.85e-05 ***
## BldgType 0.0082033 0.0052831 1.553 0.120806
## HouseStyle.1Story -0.0005548 0.0105217 -0.053 0.957957
## OverallQual 0.0692656 0.0075947 9.120 < 2e-16 ***
## OverallCond 0.0464160 0.0052461 8.848 < 2e-16 ***
## YearBuilt 0.0474832 0.0123167 3.855 0.000123 ***
## YearRemodAdd 0.0102665 0.0064397 1.594 0.111199
## RoofStyle.Gable -0.0052600 0.0041393 -1.271 0.204121
## Exterior1st.Plywood -0.0081813 0.0060467 -1.353 0.176362
## `\\`Exterior1st.Wd Sdng\\`` -0.0156152 0.0051295 -3.044 0.002395 **
## Exterior2nd.HdBoard -0.0091333 0.0055403 -1.649 0.099566 .
## Exterior2nd.MetalSd -0.0072419 0.0052553 -1.378 0.168509
## Exterior2nd.Plywood -0.0034097 0.0062517 -0.545 0.585602
## Exterior2nd.VinylSd -0.0057625 0.0069226 -0.832 0.405368
## MasVnrType.BrkFace -0.0030663 0.0046645 -0.657 0.511099
## MasVnrType.Stone 0.0010222 0.0048098 0.213 0.831749
## ExterQual -0.0017332 0.0067145 -0.258 0.796366
## ExterCond -0.0010771 0.0041715 -0.258 0.796298
## Foundation.BrkTil -0.0127511 0.0109852 -1.161 0.246022
## Foundation.CBlock -0.0127907 0.0164526 -0.777 0.437094
## Foundation.PConc -0.0026350 0.0174666 -0.151 0.880116
## BsmtQual 0.0081686 0.0074235 1.100 0.271437
## BsmtExposure 0.0196917 0.0050041 3.935 8.90e-05 ***
## BsmtFinType1 0.0031426 0.0058381 0.538 0.590504
## BsmtFinSF1 0.0061309 0.0175544 0.349 0.726974
## BsmtFinType2.Unf 0.0034549 0.0066779 0.517 0.605015
## BsmtUnfSF -0.0222625 0.0184907 -1.204 0.228886
## TotalBsmtSF 0.0428800 0.0173974 2.465 0.013881 *
## HeatingQC 0.0139327 0.0051194 2.722 0.006612 **
## CentralAir 0.0160220 0.0047279 3.389 0.000730 ***
## Electrical -0.0068111 0.0043056 -1.582 0.113986
## X1stFlrSF -0.0093755 0.0160481 -0.584 0.559209
## X2ndFlrSF -0.0021743 0.0160424 -0.136 0.892216
## GrLivArea 0.1263049 0.0189686 6.659 4.58e-11 ***
## BsmtFullBath 0.0208249 0.0056551 3.682 0.000243 ***
## BsmtHalfBath 0.0031598 0.0040411 0.782 0.434458
## FullBath 0.0142209 0.0066098 2.151 0.031681 *
## HalfBath 0.0157826 0.0058762 2.686 0.007356 **
## BedroomAbvGr -0.0045713 0.0061268 -0.746 0.455775
## KitchenQual 0.0157214 0.0061829 2.543 0.011151 *
## TotRmsAbvGrd 0.0114958 0.0082708 1.390 0.164863
## Functional.Typ 0.0210546 0.0041203 5.110 3.87e-07 ***
## Fireplaces 0.0096707 0.0071418 1.354 0.176019
## FireplaceQu 0.0097816 0.0073034 1.339 0.180775
## GarageType -0.0051448 0.0059234 -0.869 0.385302
## GarageYrBlt -0.0090382 0.0096018 -0.941 0.346778
## GarageFinish 0.0027146 0.0059826 0.454 0.650105
## GarageCars 0.0208901 0.0094218 2.217 0.026835 *
## GarageArea 0.0146619 0.0096095 1.526 0.127387
## GarageCond 0.0091355 0.0059961 1.524 0.127937
## PavedDrive 0.0064071 0.0045756 1.400 0.161743
## WoodDeckSF 0.0104103 0.0041593 2.503 0.012478 *
## OpenPorchSF 0.0012621 0.0042135 0.300 0.764591
## ScreenPorch 0.0123850 0.0038121 3.249 0.001198 **
## Fence.MnPrv 0.0045636 0.0053075 0.860 0.390087
## Fence.None 0.0025446 0.0055795 0.456 0.648449
## MoSold.April -0.0001361 0.0045003 -0.030 0.975877
## MoSold.August -0.0006647 0.0043594 -0.152 0.878849
## MoSold.July 0.0042395 0.0048937 0.866 0.386525
## MoSold.June 0.0057991 0.0048867 1.187 0.235622
## MoSold.March 0.0026420 0.0043858 0.602 0.547044
## MoSold.May 0.0115567 0.0047221 2.447 0.014564 *
## MoSold.November -0.0011085 0.0041887 -0.265 0.791340
## MoSold.October -0.0052911 0.0042313 -1.250 0.211426
## YrSold -0.0085618 0.0038305 -2.235 0.025631 *
## SaleType.WD -0.0063754 0.0061477 -1.037 0.299972
## SaleCondition.Abnorml -0.0071029 0.0073715 -0.964 0.335503
## SaleCondition.Normal 0.0280764 0.0094936 2.957 0.003176 **
## SaleCondition.Partial 0.0276435 0.0094632 2.921 0.003567 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.117 on 987 degrees of freedom
## Multiple R-squared: 0.9218, Adjusted R-squared: 0.9133
## F-statistic: 107.8 on 108 and 987 DF, p-value: < 2.2e-16
# set.seed(2017)
# rf_fit <- train(log(SalePrice) ~ . - Id, data = training,
# method = "rf",
# preProc = c("center", "scale"),
# trControl = fitControl)
# rf_fit
# testing$SalePredict2 <- exp(predict(rf_fit, testing, na.action = na.pass))
# RMSE(log(testing$SalePredict2), log(testing$SalePrice))
# sqrt(sum((log(testing$SalePredict2) - log(testing$SalePrice))^2) /
# nrow(testing))
#
# err[nrow(err) + 1, ] <-
# c("rf",
# RMSE(log(testing$SalePredict2),
# log(testing$SalePrice))
# )
# set.seed(2017)
# xgbLin_fit <- train(log(SalePrice) ~ . - Id, data = training,
# method = "xgbLinear",
# preProc = c("center", "scale"),
# trControl = fitControl)
# xgbLin_fit
# testing$SalePredict3 <- exp(predict(xgbLin_fit, testing, na.action = na.pass))
# RMSE(log(testing$SalePredict3), log(testing$SalePrice))
# sqrt(sum((log(testing$SalePredict3) - log(testing$SalePrice))^2) /
# nrow(testing))
#
# err[nrow(err) + 1, ] <-
# c("xgbLin",
# RMSE(log(testing$SalePredict3),
# log(testing$SalePrice))
# )
# set.seed(2017)
# xgbTree_fit <- train(log(SalePrice) ~ . - Id, data = training,
# method = "xgbTree",
# preProc = c("center", "scale"),
# trControl = fitControl)
# xgbTree_fit
# testing$SalePredict4 <- exp(predict(xgbTree_fit, testing, na.action = na.pass))
# RMSE(log(testing$SalePredict4), log(testing$SalePrice))
# sqrt(sum((log(testing$SalePredict4) - log(testing$SalePrice))^2) /
# nrow(testing))
#
# err[nrow(err) + 1, ] <-
# c("xgbTree",
# RMSE(log(testing$SalePredict4),
# log(testing$SalePrice))
# )
# set.seed(2017)
# ridge_fit <- train(log(SalePrice) ~ . - Id, data = training,
# method = "ridge",
# preProc = c("center", "scale"),
# trControl = fitControl)
# ridge_fit
# testing$SalePredict5 <- exp(predict(ridge_fit, testing, na.action = na.pass))
# RMSE(log(testing$SalePredict5), log(testing$SalePrice))
# sqrt(sum((log(testing$SalePredict5) - log(testing$SalePrice))^2) /
# nrow(testing))
#
# err[nrow(err) + 1, ] <-
# c("ridge",
# RMSE(log(testing$SalePredict5),
# log(testing$SalePrice))
# )
# set.seed(2017)
# glmnet_fit <- train(log(SalePrice) ~ . - Id, data = training,
# method = "glmnet",
# preProc = c("center", "scale"),
# trControl = fitControl)
# glmnet_fit
# testing$SalePredict6 <- exp(predict(glmnet_fit, testing, na.action = na.pass))
# RMSE(log(testing$SalePredict6), log(testing$SalePrice))
# sqrt(sum((log(testing$SalePredict6) - log(testing$SalePrice))^2) /
# nrow(testing))
#
# err[nrow(err) + 1, ] <-
# c("glmnet",
# RMSE(log(testing$SalePredict6),
# log(testing$SalePrice))
# )
err[order(err$rmse), ]
## model rmse
## 1 lm 0.108485464227176
## examine correlations across model predictions
# cor(testing[, (ncol(testing)-6):ncol(testing)])
## re-train models on entire training dataset
# set.seed(2017)
lm_full <- train(log(SalePrice) ~ . - Id, data = train,
method = "lm",
preProc = c("center", "scale"),
trControl = fitControl)
lm_full
## Linear Regression
##
## 1459 samples
## 110 predictor
##
## Pre-processing: centered (109), scaled (109)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1313, 1311, 1314, 1313, 1314, 1314, ...
## Resampling results:
##
## RMSE Rsquared
## 0.1202567 0.9099871
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# set.seed(2017)
# rf_full <- train(log(SalePrice) ~ . - Id, data = train,
# method = "rf",
# preProc = c("center", "scale"),
# trControl = fitControl)
# rf_full
# set.seed(2017)
# xgbLin_full <- train(log(SalePrice) ~ . - Id, data = train,
# method = "xgbLinear",
# preProc = c("center", "scale"),
# trControl = fitControl)
# xgbLin_full
# set.seed(2017)
# xgbTree_full <- train(log(SalePrice) ~ . - Id, data = train,
# method = "xgbTree",
# preProc = c("center", "scale"),
# trControl = fitControl)
# xgbTree_full
# set.seed(2017)
# ridge_full <- train(log(SalePrice) ~ . - Id, data = train,
# method = "ridge",
# preProc = c("center", "scale"),
# trControl = fitControl)
# ridge_full
# set.seed(2017)
# glmnet_full <- train(log(SalePrice) ~ . - Id, data = train,
# method = "glmnet",
# preProc = c("center", "scale"),
# trControl = fitControl)
# glmnet_full
stopCluster(cl)
## linear model prediction
test$SalePrice <- exp(predict(lm_full, test, na.action = na.pass))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
test <- test[, which(names(test) %in% names(train))]
## xgbTree prediction
# test$SalePrice <- exp(predict(xgbTree_full, test, na.action = na.pass))
# test <- test[, which(names(test) %in% names(train))]
## combine model predictions by finding mean prediction for each property
# test$SalePrice <- exp(
# rowMeans(data.frame(
# predict(lm_full, test, na.action = na.pass),
# predict(rf_full, test, na.action = na.pass),
# predict(xgbLin_full, test, na.action = na.pass),
# predict(xgbTree_full, test, na.action = na.pass),
# predict(ridge_full, test, na.action = na.pass),
# predict(glmnet_full, test, na.action = na.pass)),
# na.rm = TRUE)
# )
predictions <- data.frame(Id = test$Id, SalePrice = test$SalePrice)
head(predictions)
## Id SalePrice
## 1 1461 119241.4
## 2 1462 162731.6
## 3 1463 178631.7
## 4 1464 198460.5
## 5 1465 195636.8
## 6 1466 169800.0
predictions[is.na(predictions$SalePrice), ]
## [1] Id SalePrice
## <0 rows> (or 0-length row.names)
## save output, change filename as needed
# write.csv(predictions, file = "Submission_052317_lin2.csv", quote = FALSE, row.names = FALSE)
My best public root mean squared error (RMSE) score in Kaggle’s House Prices: Advanced Regression Techniques competition was 0.12179 (user name: janderman, display name: Judd Anderman), which was the result of my most recent modeling attempt following a few rounds of iteration, error checking, and refinement. In this case, I used only used my fitted linear model to predict property SalePrice in the unlabeled test dataset. From my perspective, the relative success of this last submission was a result of missing data imputation - in most cases missing data points were in fact meaningful and so were fairly easy to impute - and recoding of the predictors and target variables as seemed appropriate in each case, whether that involved performing log transformations, binary coding, or casting categorical variables as numeric ones.
*Addendum: I was able to achieve a slightly lower RMSE of 0.12033 on the public leaderboard data by averaging the output of several trained models applied to the test dataset, including the linear model I had used previously. This latter approach was significantly more computationally intensive and time-consuming for what appears to be a relatively modest gain in predictive performance. The relevant code is contained in the last couple of code chunks above but commented out, however, it can be found in a separate R markdown file. While I found it productive to partition the training data so that I could evaluate and compare the performance of different models against known sale prices, I did find that retraining my chosen model(s) on the full training dataset produced improved results. Still, my largest gains in RMSE, the competition’s evaluation metric, occured early on after more careful examination and deliberate processing and transformation of the supplied training and testing data.